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Abstract In this paper, we consider a class of finite-sum convex optimization problems whose objective function 
is given by the summation of m (> 1) smooth components together with some other relatively simple terms. 
We first introduce a deterministic primal-dual gradient (PDG) method that can achieve the optimal black¬ 
box iteration complexity for solving these composite optimization problems using a primal-dual termination 
criterion. Our major contribution is to develop a randomized primal-dual gradient (RPDG) method, which needs 
to compute the gradient of only one randomly selected smooth component at each iteration, but can possibly 
achieve better complexity than PDG in terms of the total number of gradient evaluations. More specifically, we 
show that the total number of gradient evaluations performed by RPDG can be times smaller, both in 

expectation and with high probability, than those performed by deterministic optimal first-order methods under 
favorable situations. We also show that the complexity of the RPDG method is not improvable by developing a 
new lower complexity bound for a general class of randomized methods for solving large-scale finite-sum convex 
optimization problems. Moreover, through the development of PDG and RPDG, we introduce a novel game- 
theoretic interpretation for these optimal methods for convex optimization. 

Keywords: convex programming, complexity, incremental gradient, primal-dual gradient method, Nesterov’s 
method, data analysis 
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1 Introduction 

The basic problem of interest in this paper is the convex programming (CP) problem given by 

•= {^(®) ■= + Kx) + Mw(a;)} . (1.1) 

Here, X C K” is a closed convex set, h is a relatively simple convex function, fi : M" —>• K, f = 1,..., m, are 
smooth convex functions with Lipschitz continuous gradient, i.e., 31/^ > 0 such that 

l|V/i(a;i) - V/i(a; 2 )||* < - *211, Va;i,a: 2 eK", (1.2) 
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a; : X —^ K is a strongly convex fnnction with modulus 1 w.r.t. an arbitrary norm || ■ ||, i.e., 

{u}'{x\) — uj'{X 2 ),X\ — X 2 ) > ^\\x\ — X 2 \\^, 'iX\,X 2 &X, (1-3) 

and /r > 0 is a given constant. Hence, the objective function ^ is strongly convex whenever /r > 0. For notational 
convenience, we also denote /(x) = ^ = YlTLi^i- It is easy to see that for some Lf > 0, 

||V/(xi) - V/(X 2 )||* < i/||xi - X 2 II < i||xi - X 2 II, Vxi,X 2 £lR". (1.4) 

Throughout this paper, we assume subproblems of the form 

argmin,^g;^(c/,x} + /i(x) + pa;(x) (1.5) 

are easy to solve. CP given in the form of (ini) has recently found a wide range of applications in machine 
learning, statistics, and image processing, and hence becomes the subject of intensive studies during the past few 
years. 

Stochastic (sub)gradient descent (SGD) (a.k.a. stochastic approximation (SA)) type methods have been 
proven useful to solve problems given in the form of (El). SGD was originally designed to solve stochastic 
optimization problems given by 

minE5[F(x,C)], (1.6) 

where ^ is a random variable with support S' C R'^. Problem (11.11) can be viewed as a special case of (11.61) 
by setting ^ to be a discrete random variable supported on {!,...,m} with Prob{5 = i} = and F{x,i) = 
fi{x) + h{x) + /ica(x), i = l,...,m. Since each iteration of SGDs needs to compute the (sub)gradient of 
only one randomly selected fi 0, their iteration cost is significantly smaller than that for deterministic first-order 
methods (FOM), which involves the computation of first-order information of / and thus all the m (sub)gradients 
of /i’s. Moreover, when /j’s are general nonsmooth convex functions, by properly specifying the probabilities Vi, 
i = 1,..., m0, it can be shown (see [23) that the iteration complexities for both SGD and FOM are in the same 
order of magnitude. Consequently, the total number of subgradients required by SGDs can be m times smaller 
than those by FOMs. 

Note however, that there is a significant gap on the complexity bounds between SGDs and deterministic 
FOMs if /i’s are smooth convex functions. For the sake of simplicity, let us focus on the strongly convex case 
when p > 0 and let x* be the optimal solution of (E- In order to hnd a solution x £ X s.t. ||x — x*||^ < e, the 
total number of gradient evaluations for fi ’s performed by optimal FOMs can be bounded by 

o|myflogi|, (1.7) 

which was first achieved by the well-known Nesterov’s accelerated gradient method |271I28| . see also relevant 
extensions in EHllSS]. On the other hand, a direct application of optimal SGDs to the aforementioned stochastic 
optimization reformulation of E would yield an 

+ (1-8) 

iteration complexity bound on the number of gradient evaluations for /^’s, which was first achieved by the 
accelerated stochastic approximation method ([UlIllIS])- Here cr > 0 denotes variance of the stochastic gradients. 
Clearly, the latter bound is signihcantly better than the one in E in terms of its dependence on m, but much 
worse in terms of its dependence on accuracy e and a few other problem parameters (e.g., L and p). 

It should be noted that the optimality of E for general stochastic programming E does not preclude 
the existence of more efficient algorithms for solving E, because E is a special case of E with finite 
support S’. Last few years have seen very active and fruitful research in this field (e.g., |32[|17[|12[|34[|36p . In 
particular, Schmidt, Roux and Bach |32) presented a stochastic average gradient (SAG) method, which recursively 

^ Observe that the subgradients of h and lj are not required due to the assumption in (II.5D . 

^ Suppose that fi are Lipschitz continuous with constants Mi and let us denote M := we should set Ui = MifM in 

order to get the optimal complexity for SGDs. 
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computes an estimator of V/ by aggregating the gradient of a randomly selected with some other previously 
computed gradient information. They proved that the complexity of SAG is bounded by O (^{m + Lf/i) log i), see 
also Johnson and Zhang m and Defazio et al. [12] for similar complexity results for solving uni). In a related 
but different line of research, Shalev-Shwartz and Zhang |34| studied a special class of CP problems given in 
the form of dni) with fi{x) given by where ai denotes an affine mapping. Under the assumption that 

u{x) = 11 * 112 , they presented an accelerated stochastic dual coordinate ascent (A-SDCA) method, obtained by 
properly restarting a stochastic coordinate ascent method in [33] applied to the dual of (El). Shalev-Shwartz and 

Zhang show that the iteration complexity of this method can be bounded by CJ | log 7 I. However, 

each iteration of A-SDCA requires, instead of the computation of V/i, the solution of a subproblem given in the 
form of 

argmin{(c/, y) + 4>*(y) -h ||y |1*}, (1.9) 

where 4>* denotes the conjugate function of (pi- Moreover, these methods were also designed for solving a more 
special class of problems than E- More recently, Lin, Lu, and Xiao |23| proposed to apply the accelerated 
coordinate descent methods by Nesterov |30j, and Fercoq and Richtariks m to obtain similar results for solving 
these “regularized empirical loss functions” as in |34| . Zhang and Xiao [36] had also obtained similar results by 
using different stochastic primal-dual coordinate decomposition techniques. 

In this paper, we focus on randomized incremental gradient methods that can access the first-order information 
of only one randomly selected smooth component fi at each iteration (see Bertsekas [5] for an introduction to 
incremental gradient methods). It should be noted that while the algorithms in |321II7l[T^ belong to incremental 
gradient methods, generally speaking, the dual coordinate algorithms in |231I341[36] cannot be considered as 
incremental gradient methods because they require the solutions of a different subproblem rather than the 
computation of the gradient of f^. The previous attempts to improve the complexity of the existing incremental 
gradient methods, e.g., based on the extrapolation idea in Nesterov m, however, turned out to be tricky and 
unsuccessful, see Section 1.2 of Bertsekas |5] and Section 5 of Agarwal and Bottou [1] for more discussions. Another 
important yet unresolved issue is that there does not exist a valid lower complexity bound for randomized 
incremental gradient methods in the literature. Hence, it remains unknown what would be the best possible 
performance that one can expect for these types of methods. Regarding this question, Agarwal and Bottou [1] 
recently suggested a lower complexity bound for solving problems given in the form of E- However, as pointed 
out by them in a recent ISMP talk in 2015, the lower complexity bound in [1] is deterministic by construction, 
and hence cannot be used to justify the optimality or suboptimality for the randomized incremental gradient 
methods in [SllEIIll or dual coordinate methods in |231I341[36] . 

Our contribution in this paper mainly lies on the following several aspects. Firstly, we present a new class 
of deterministic FOMs, referred to as the primal-dual gradient (PDG) methods, which can achieve the optimal 
black-box iteration complexity in E for solving E- The novelty of these methods exists in: I) a proper 
reformulation of E as a primal-dual saddle point problem and 2 ) the incorporation of a new non-differentiable 
prox-function (or Bregman distance) based on the conjugate functions of fi in the dual space. As a consequence, we 
are able to show that the PDG method covers a variant of the well-known Nesterov’s accelerated gradient method 
as a special case. In particular, the computation of the gradient at the extrapolation point of the accelerated 
gradient method is equivalent to a primal prediction step combined with a dual ascent step (employed with 
the aforementioned dual prox-function) in the PDG method. While it is often difficult to interpret Nesterov’s 
method, the development of the PDG method allows us to view this method as a natural iterative buyer-supplier 
game. Such a game-theoretic view of the accelerated gradient method seems to be new in the literature. In fact, 
the obtained complexity results for the PDG method are slightly stronger than the one in E and those in [271 
[28] for Nesterov’s accelerated gradient method, because a stronger primal-dual termination criterion has been 
used in our analysis. 

Secondly, we develop a randomized primal-dual gradient (RPDG) method, which is an incremental gradient 
method using only one randomly selected component V/i at each iteration. A variant of PDG, this algorithm 
incorporates an additional dual prediction step before performing the primal descent step (with a properly defined 
primal prox-function). We prove that the number of iterations (and hence the number of gradients) required by 
RPDG is bounded by 


o 


(1.10) 
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both in expectation and with high probability. The complexity bounds of the RPDG method are established in 
terms of not only the distance from the iterate to the optimal solution, but also the primal optimality gap 
based on the ergodic mean of the iterates. In comparison with the accelerated stochastic dual coordinate ascent 
method in [3l], RPDG deals with a wider class of problems and can be applied to the cases when fi’s involve 
a more complicated composite structure (see examples in [^) and/or a more general regularization term lo that 
is strongly convex with respect to an arbitrary norm (see open problems in Section 7 of ED). Moreover, each 
iteration of RPDG only involves the computation V/i, rather than the more complicated subproblem in (HJl), 
which sometimes may not have explicit solutions [3l] (e.g., the logistics regression problem). The RPDG method 
also admits an interesting game theoretic interpretation, implying that by properly incorporating randomization, 
the buyer and supplier can reach the equilibrium with possibly fewer price changes at the expense of more order 
transactions. 

Thirdly, we show that the number of gradient evaluations required by any randomized incremental gradient 
methods to hnd an e-solution of (ini),i.e., a point X € X s.t. E[||a; — a;*||i] < e, cannot be smaller than 

whenever the dimension n is sufficiently large. This bound is obtained by carefully constructing a special class of 
separable quadratic programming problems and tightly bounding the expected distance to the optimal solution 
for any arbitrary distribution used to choose fi at each iteration. Comparing (11.101) with (11.111) , we conclude that 
the complexity of the RPDG method is optimal if n is large enough. To the best of our knowledge, this is the 
first time that such a lower complexity bound has been presented for randomized incremental gradient methods 
in the literature. As a byproduct, we also derived a lower complexity bound for randomized block coordinate 
descent methods by utilizing the separable structure of the aforementioned worst-case instances. These methods 
have been intensively studied recently, but a valid lower complexity bound is still missing in the literature. 

Finally, we generalize RPDG for problems which are not necessarily strongly convex (i.e., fJ. = 0) and/or involve 
structured nonsmooth terms f^. We show that for all these cases, the RPDG can save 0{^/m) times gradient 
computations (up to certain logarithmic factors) in comparison with the corresponding optimal deterministic 
FOMs. In particular, we show that when both the primal and dual of (uni are not strongly convex, the total 
number of iterations performed by the RPDG method can be bounded by (up to some logarithmic 

factors), which is times better, in terms of the total number of dual subproblems to be solved, than 

Nesterov’s smoothing technique [29], Nemirovski’s mirror-prox method |24| . or Chambolle and Pock’s primal- 
dual method |8|. It seems that this complexity result has not been obtained before in the literature. 

It is worth mentioning a few relevant works to our development. The most two related ones are conducted 
independently by Dang and Lan m, and Zhang and Xiao [36]. Both of these papers deal with randomized 
variants of the primal-dual method presented by Chambolle and Pock |S] (see also extensions in nni) for solving 
saddle point problems. Zhang and Xiao’s development [36] was based on a variant of the primal-dual method 
for solving strongly convex saddle point problems [8]. They were able to show that a block-wise randomized 
version of the algorithm can achieve similar complexity as the A-SDCA method in |34| . Since Zhang and Xiao’s 
algorithm targets for solving a similar class of problems and requires the solutions of a similar subproblem to 
|34| . it appears that the aforementioned possible advantages of RPDG over A-SDCA are also applicable to the 
stochastic primal-dual coordinate method in [36]. Moreover, the complexity bound of Zhang and Xiao’s algorithm 
is only established in terms of the Euclidean distances of the iterate to the optimal solution. They did not 

deal with the convergence of the ergodic mean of iterates. On the other hand, Dang and ban’s work was motivated 
by the observation in [^ that a direct extension of the alternating direction method of multiplier (ADMM) does 
not converge for multi-block problems. Their work in m then focuses on the non-strongly convex case and 
shows that a randomized primal-dual method, which is equivalent to a randomized pre-conditioned ADMM for 
linear constrained problems, does converge for multi-block problems. Without incorporating the aforementioned 
dual prediction step, the complexity obtained in m is 0{^/m) times worse than Chambolle and Pock’s method. 
Nevertheless, this is the first time that randomized algorithms for saddle point optimization with an 0{l/t) 
complexity has been presented in the literature. More recently, close to the end of the preparation of this paper, 
we notice that Lin, Mairal, and Harchaoui [22] in a concurrent work presented a catalyst scheme that can be 
used to accelerate the SAG method in |32| and thus possibly achieve the complexity bound in (11.1011 (under the 
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Euclidean setting). While their approach is an indirect one obtained by properly restarting SAG (or other “non¬ 
accelerated” first-order methods), the proposed randomized primal-dual gradient method is a direct approach 
with a “built-in” acceleration. Also none of these works discussed the lower complexity bound for 

randomized methods. 

This paper is organized as follows. We first study the deterministic primal-dual method in Section[2 Section[3] 
is devoted to the design and analysis of the randomized primal-dual method for the strongly convex case, as well 
as the development of the lower complexity bound in dm]). In Section [H we generalize the RPDG method to 
different classes of CP problems that are not necessarily strongly convex. Important technical results and proofs 
of the main theorems in Sections and [3] are provided in Section (5] Some brief concluding remarks are made in 
Section [6l 

Notation and terminology. We use || • || to denote an arbitrary norm in M", which is not necessarily associated 
with the inner product (•, •). We also use || • ||* to denote the conjugate norm of || • ||. For any convex function h, 
dh{x) is the set of subdifferential at x. Given any X C M", we say a convex function ft : A —>■ K is nonsmooth if 
\h{x) — h{y)\ < Mfi\\x — y\\ for any x,y £ X. We say that a convex function / : A —>■ R is smooth if it is Lipschitz 
continuously differentiable with Lipschitz constant L > 0, i.e., || V/(i/) — V/(a;)||* < L||j/ — 2 ;|| for any x,y £ X. For 
any p > 1, II ■ lip denotes the standard p-norm in R", i.e., 

n 

~ l^iTi for any x £ R". 

1=1 

For any real number r, [r] and [rj denote the nearest integer to r from above and below, respectively. R+ and 
R++, respectively, denote the set of nonnegative and positive real numbers. M denotes the set of natural numbers 
{ 1 , 2 ,...}. 


2 An optimal primal-dual gradient method 

Our goal in this section is to present a novel primal-dual gradient (PDG) method for solving (11.11) . which will 
also provide a basis for the development of the randomized primal-dual gradient methods in later sections. 
We establish the optimal convergence of this algorithm in terms of the primal-dual optimality gap under the 
assumption that the gradient of / is computed at each iteration. We show that PDG generalizes one variant 
of the well-known Nesterov’s accelerated gradient method, and allows a natural game interpretation, and hence 
that the latter algorithm also admits a similar interpretation. 


2.1 Preliminaries: primal and dual prox-functions 

In this subsection, we discuss both primal and dual prox-functions (proximity control functions) in the primal 
and dual spaces, respectively. 

Recall that the function lo ■. A —>■ R in (El) is strongly convex with modulus 1 with respect to || • ||. We can 
define a primal prox-function associated with cj as 

P{x°, x) = Pui{x°, x) := oj{x) — [w(x°) -|- {ij'{x°),x — a;°)], {2-1) 

where £ dijj{x°) is an arbitrary subgradient of ut at x'^. Clearly, by the strong convexity of ui, we have 

P{x°,x)>^\\x-x°f, yx,x°£X. (2.2) 

Note that the prox-function P{-,-) described above generalizes the Bregman’s distance in the sense that uj is 
not necessarily differentiable (see | 6 l[ 2 ll^fT 8 ] and references therein). Throughout this paper, we assume that the 
prox-mapping associated with A, cj, and ft, given by 

Mx(g,x°,n) = Mx,u,h(9,x°,v) '■= ^{g,x) + h{x) + y.w{x) -f yP{x°,x)^ , 


(2.3) 
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is easily computable for any £ X,g £ K", g, > 0, and 77 > 0. Clearly this is equivalent to the assumption 
that is easy to solve. Whenever uj is non-differentiable, we need to specify a particular selection of the 

subgradient uj' before performing the prox-mapping. We assume throughout this paper that such a selection of 
Lo' is defined recursively as follows. Denote = A4x(g,x^,v)- By the optimality condition of (12.3L we have 

g + h'{x^) + (m + nW{x^) - V‘^'{x°) £ Afx{x^), 

where Afx denotes the normal cone of X at x^. Once such a w'(a:^) satisfying the above relation is identified, we 
will use it as a subgradient when defining P{x^,x) in the next iteration. 

Now let us consider the dual space Q, where the gradients of / reside, and equip it with the conjugate norm 
II • II*. Let Jy : 5 K be the conjugate function of / such that 

f{x) :=ma.x{x,g) - Jf{g). (2.4) 

g£y 

It is clear that Jf is strongly convex with modulus 1/L/ w.r.t. || • ||». Therefore, we can define its associated dual 
prox-functions and dual prox-mappings as 

Df{g°,g) ■.= Jfig) - [Jf{g°) + {Jf{g°),g - g°)], (2.5) 

Mg{-x,g°, t) ■= argmin|(-i,g) + Jf{g) + TDf{g°,g)\ , (2.6) 

g£G V ) 

for any g'^,g £ Q. Again, Df may not be uniquely defined since Jf is not necessarily differentiable. Instead of 
choosing Jf £ dJf similarly to u>', we can explicitly specify such selections as will be discussed later in this paper. 

The following simple result shows that the computation of the dual prox-mapping associated with Df is 
equivalent to the computation of V/. 

Lemma 1 Let x £ X and g^ £ G be given and Df{g^,g) be defined in I A ,51) . For any r > 0, let us denote z = 
[x -h Tj'f{g^)\/{1 t). Then we have Xf{z) = Mg{—x, g°, r). 

Proof. In view of the definition oi Df in (12.511 . we have 

Mg{-x,g°,T) = arginin\-{x + TJf{g°),g) + (1 -|-r)J/(c/)| = argmax{(z, 5 } - Jf{g)} = Xf{z). 
g£y '■ J g£y 


2.2 Primal-dual gradient method, Nesterov’s method, and a game interpretation 
By the definition of Jf in (12.411 . problem (11.111 is equivalent to: 

<P* := min < h{x) -|- g,uj{x) + ma.x{x,g) — Jf{g) > . (2-7) 

xGX ( g^g j 

The primal-dual gradient method in Algorithm [1] can be viewed as a game iteratively performed by a primal 
player (buyer) and a dual player (supplier) for finding the optimal solution (order quantity and product price) 
of the saddle point problem in dSTI). In this game, both the buyer and supplier have access to their local cost 
h(x) + g.Lu(x) and Jf{g), respectively, as well as their interactive cost (or revenue) represented by a bilinear 
function {x,g). Our goal is to design an algorithm such that the buyer and supplier can achieve a equilibrium as 
soon as possible. In the proposed algorithm, the supplier first applies (EH) to predict the demand x* based on 
historical information, i.e., x^~^ and x*~^. She then determines in (E3 the price g* in a way to maximize the 
predicted profit {x*,g) — Jf{g), regularized by the dual prox-function Df{g*~^,g) with a certain weight rt > 0. 
Once after the supplier has made her decision, the buyer then determines his action according to (12.1011 in order 
to minimize the cost h{x) g,uj{x) -|- {x,g), regularized by the primal prox-function P{x*~^,x) with a certain 
weight r]t > 0. 

In order to implement the above primal-dual gradient method, it is more convenient to rewrite step (1^ 
in a form involving the computation of gradient rather than the dual prox-mapping Mg. In order to do so, we 
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Algorithm 1 The primal-dual gradient method 

Let = x~^ E A, and the nonnegative parameters {rt}, {r/t}, and {at} be given. 
Set gO = V/(xO). 
for t = 1,... do 

Update (x^,g^) according to 


X* = at{x* ^ ^) + x* 

(2.8) 

9* = Mg(-x^,g^~^,Tt). 

(2.9) 


(2.10) 

end for 


shall specify explicitly the selection of the subgradient Jj in (12.911. Denoting 
= V/(a;°) that G dJf{g^). Using this relation and letting J'j:{g*~^) = 

then conclude from Lemma [1] that for any t > 1, (12.911 reduces to 

= , we can easily see from 

in Df(g*~^,g) (see (12.51)). we 

= {x*+Ttx*~^)/(l + Tt) and = V/(£‘). 


With the above selection of the dual prox-function, we can specialize the primal-dual gradient method as follows. 

Algorithm 2 A particular implementation of the primal-dual gradient method 

Input: Let x’^ = x ^ E A, and the nonnegative parameters {rt}, {r/t}, and {at} be given. 

Set x^ = x^. 

for t = 1,2,... ,k do 


X* = at(x^~^ — + x*~^. 

(2.11) 

X* = (x* -1- Ttx^~^) /(I + Tt). 

(2.12) 

5* = V/(x‘). 

(2.13) 

II 

T 

(2.14) 

end for 


Observe that one potential problem associated with this scheme is that the search points x* dehned in ( 12.1111 
and (I2.12L resoectivelv. mav fall outside A. As a result, we need to assume f to be differentiable over K". 
However, it can be shown that by properly specifying at and rt, we can guarantee x* € X and thus relax such 
restrictions on the differentiability of / (see (12.3111 and (12.3211 below). 

The above PDG method is related to the well-known Nesterov’s accelerated gradient (AG) method. Let us 
focus on a simole variant of the AG method that has been extensivelv studied in the literature (e.s.. I281I351I191 
I141I151I16I'). Given (x*~^.x*~^) G A x A. this AG aleorithm uodates (x*.x^) bv 

X* = {1 - Xt)x*~^ -)- At®‘“\ 

(2.15) 

x* = Mx{g\x*~^,7jt), 

(2.16) 

x^ = (1- Xt)x*~^ + Xtx*, 

(2.17) 


for some At G [0,1]. By (I2.15|) and (I2.17L we have 

X* = {1 - At)[{l - + At-i®‘“^] + At®*“^ 

= (1 - At)[**“^ - At-i**”^ + Af_ia;*“^] + \tx*~^ 

= (1 - \t)x*~^ + (1 - At)At_i(®*“^ - x*~^) + Xtx*~^. 


Therefore, (12.1511 is equivalent to (12.1111 and (12.121) with rt = (1 — Xt)/Xt and at = At_i(l — Xt)/Xt. Moreover, 
(12.1611 is identical to (12.1411 land (12.1011 1. and (12.1711 basically defines the outpnt of the AG algorithm as an ergodic 
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mean of the iterates x*. We then conclude that the above variant of Nesterov’s AG method is a special case of 
Algorithm [2] (and Algorithm [1]). It should be noted, however, that Algorithm [1] provides more flexibility in the 
specihcation of parameters, which will be used later in the development of the RPDG method. Moreover, the 
presentation of the PDG method helps us to reveal a natural game interpretation out of the intertwined and 
somehow mysterious updating of the three search sequences in the AG method. 

Algorithm [1] is also closely related to Chambolle and Pock’s primal-dual method for solving saddle point 
problems [8], which explains the origin of its name. Two versions of primal-dual methods were discussed in [8]. 
One is designed for solving general saddle point problems without assuming the strong convexity of Jy and the 
other one is to deal with the case when J/ is strongly convex by incorporating an additional extrapolation step. 
As pointed out in Remark 3 of [8] , the rate of convergence for the latter primal-dual method is only suboptimal 
for solving CH) as it uses a weaker termination criterion. On the other hand, the PDG method does not involve 
any additional extrapolation steps and so it shares a similar scheme to the basic version of the primal-dual 
method in [8]. Moreover, the original primal-dual methods in [8] do not employ general prox-functions, which, 
as shown in Lemma [1] is crucial to relate the dual step (12.911 to the computation of the gradients. It should 
be noted that some recent extensions of the primal-dual method in |101llll F7] indeed consider the incorporation 
of prox-functions, but restricted to problems without strong convexity. Hence, none of these earlier primal-dual 
methods can be viewed as a generalized accelerated gradient method. 


2.3 Convergence properties of the primal-dual gradient method 


Our goal in this subsection is to show that Algorithm [T] exhibits an optimal rate of convergence for solving 
problem (113. It is worth mentioning that our analysis significantly differs from the previous studies on optimal 
gradient methods and those on primal-dual methods for saddle point problems. 

Given a pair of feasible solutions « = {x,g) and z = {x,g) of (12.711 . we define the primal-dual gap function 
Qf{z,z) by 

Qf{z,z) := + fj.Lu{x) -b {x,g) - Jf{g)] - [h{x) + g,uj{x) -|- {x,g) - Jf{g)] ■ (2-18) 

It can be easily seen that z (resp., z) is an optimal solution of (12.711 if and only if Qf{z, z) <0 for any z ^ X x Q 
(resp., Qf{z,z) > 0 for any z & X x Q). Therefore, one can assess the solution quality of z by the primal-dual 
optimality gap: 

gap{2) := max Qf{z,z). (2.19) 

z^X X Q 

It should be noted that gap(z) may not be well-defined, for example, when X is unbounded and h is not strictly 
convex. In these cases, we can define a slightly modified primal-dual gap 

gap*(2) := ma.x{Qf{z,z) : x = x*,g G G} (2.20) 

for an arbitrary optimal solution x* of (11.11) . Since J/ is strongly convex, gap* is well-dehned. 

The following result establishes some relationship between the primal optimality gap !?'(«) — <]>* and the above 
primal-dual optimality gaps. 

Lemma 2 Let z = {x,g) £ X x G be a given pair of feasible solutions of and denote g* = Vf{x). Also let 

z* = {x*,g*) be a pair of optimal solutions of Then we have 

<T{x) - ^(x*) = Qf{{x,g*), {x*,r)) < gap*(^). (2.21) 

If m addition, X is bounded, then 

gap*(2) < gap(5). (2.22) 


Proof. It follows from the definitions of g *, gap* and the gap function Q f that 


Tr{x)-<P{x*)=Qf({x,g*),{x*,r)) 

= [h{x) + fiuj{x) + nia.x{x,g) - Jf{g)] - [h{x*) -|- fj.u}{x*) + {x*,g*) - Jf{g*)] 
g^y 

< [h{x) -b puj{x) -bmax(i,5) - Jf{g)] - [h{x*) + ij.u}{x*) + {x* ,g) - Jf{g)] 


= gap*(z). 
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Relation (12.221) follows directly from the definitions of gap* and gap. ■ 

Theorem[T]below describes the main convergence properties of the PDG method. More specifically, we provide 
in Theorem [1] a) a constant stepsize policy which works for the strongly convex case where p > 0, and a different 
parameter setting that works for the non-strongly convex case with p = 0 in Theorem mb). Note that for the 
strongly convex case, we estimate the solntion quality for the iterates x*',t = 1,..., fc, as well as that for their 
ergodic mean 

(2-23) 

for some 9t > 0, while only establishing the error bounds for for the non-strongly convex case. We put the 
proof of Theoremmin Section [5] since it shares many basic elements with the convergence analysis of the RPDG 
method. 

Theorem 1 Let x* be an optimal solution of EIP, and x^ be defined in 112. 10\) and 12.2,91) . respectively. 

a) Suppose that p > 0 and that {rt}, {lit}, {o^t} and {9t} are set to 

n = 9t = y^^Lfn, at=a = -^^^=, and = ^, Vt = 1,..., fc. (2.24) 

Then, 

P{x\ X*) < X*), (2.25) 

<F{x'^) - T{x*) < gap*(z'=) < p(l - a)-i [l + ^(2 + ^)] a'^P{x°,x*), (2.26) 

d/{x^) — T[x*) < gap(«^) < /i(l — a)~^ [l -|- -^(2 -|- ■^)] ce^ maxP(a;°, x). (2.27) 

b) Suppose that {rt}, {ry}, {ot} a,nd {9t} are set to 

= = ^ 9t = t, yt = I,... ,k. (2.28) 

Then, 

Hx^) - Hx*) < gap*(.'=) < (2-29) 

T{x") - T{x*) < gap(z'^) < maxP(a:°, x). (2.30) 


Observe that when the algorithmic parameters are set to (IT^ . by using an inductive argument, we can 
easily show that 

x^ = {1- o?)x'^-^ -b (1 - a)Etf -h Q^x°. (2.31) 

In other words, ^ can be written as a convex combination of a;°,... ,x^~^ and hence x^ £ X for any fe > 1. 
Similarly, when the algorithmic parameters are set to (12.2811 . we can show by using induction that 


k _ 2 ( 2 fc-l) 
- “ k(k+l) 


X 


k-1 




(2.32) 


which implies x^ € X. Therefore, we only need to assume the differentiability of / over X rather than the whole 
K". 

In view of the results obtained in Theorem m the primal-dual gradient method is an optimal method for 
convex optimization. In fact, the rates of convergence in (I2.26L (I2.27L (12.291) and (12.3011 associated with the 
ergodic mean have employed the primal-dual optimality gaps g*{z^) and g{z^), which are stronger than the 
primal optimality gap T'{x^) — T'{x*) used in the previous studies for accelerated gradient methods. Moreover, 
whenever X is bounded, the primal-dual optimality gap g{z^) gives us a computable online accuracy certificates 
to check the quality of the solution z^ (see miM for some related discussions). Also observe that each iteration 
of the PDG method requires the computation of V/, and hence all the m components Xfi. In the next section, 
we will develop a randomized PDG method that can possibly save the number of gradient evaluations for X fi 
by utilizing the finite-sum structure of problem dni. 
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3 Randomized primal-dual gradient methods 

In this section, we present a randomized primal-dual gradient (RPDG) method which needs to compute the 
gradient of only one randomly selected component function fi at each iteration. We show that RPDG can 
possibly achieve a better complexity than PDG in terms of the total number of gradient evaluations. 


3.1 Multi-dual-player reformulation and the RPDG algorithm 

We start by introducing a different saddle point reformulation of (ini) than (12.711 . Let Ji : W ^ K. be the 
conjugate functions of fi and Wj * = 1, • •., m, denote the dual spaces where the gradients of fi reside. For the 
sake of notational convenience, let us denote J{y) := ^ := x 3^2 x ... x J’m, and y = (i/i; 1/2 ;...; ym) 

for any yi € (Pi, i = 1,..., m. Glearly, we can reformulate problem dO) equivalently as a saddle point problem: 


>?'* := min < h{x) + + max(a;, Uy) — J{y)> , (3-1) 

x&x yey j 

where P G is given by 

P:= (3.2) 

Here 7 is the identity matrix in K". Given a pair of feasible solutions 2 = (x,y) and z = (x,y) of (13.Ill , we dehne 
the primal-dual gap function Q{z,z) by 

Q{z,z) := [h{x) + izuj{x) -(- {x,Uy) - J{y)] - [h{x) + yuj{x) + {x,Uy) - J{y)]. (3.3) 

It is well-known that zGZ = Xxy is an optimal solution of (EH) if and only if Q{z, 2) < 0 for any z G Z. 

Since Ji,i = 1 ,... ,m, are strongly convex with modulus ai = 1/Li w.r.t. || • ||*, we can define their associated 

dual prox-functions and dual prox-mappings as 

Di{yi,yi) ■■= Myi) - [Myi) + {J'i{yi),yi - y?)], (3.4) 

My^{-X,y/,T) := arg min \{-x,y) + Ji{yi) + xDiiJl,yi)\, (3.5) 

for any y°,yi G W ■ Accordingly, we define 

D{y,y) (3-6) 


Again, Di may not be uniquely defined since Ji are not necessarily differentiable. However, we will discuss how 
to specify the particular selection of j' G dJi later in this subsection. 

We are now ready to describe the randomized primal-dual method, which is obtained by properly modifying 
the primal-dual gradient method as follows. Firstly, in (13.811 . we only compute a randomly selected dual prox- 
mapping My^ rather than the dual prox-mapping Mg as in Algorithm [T] Secondly, in addition to the primal 
prediction step EH, we add a new dual prediction step EH, and then use the predicted dual variable y* for 
the computation of the new search point x* in (IXTOl) . It can be easily seen that the RPDG method reduces to 
the PDG method whenever this algorithm is directly applied to (EH (i-e-, m = 1, yi = Q, and Ji = Jf) . 

Similarly to the PDG method, the RPDG method can be viewed as a game iteratively performed by a buyer 
and m suppliers for finding the solutions (order quantities and product prices) of the saddle point problem in (13.111 . 
In this game, both the buyer and suppliers have access to their local cost h{x) + iJio{x) and Jiiyji), respectively, 
as well as their interactive cost (or revenue) represented by a bilinear function {x,yi). Also, the buyer has to 
purchase the same amount of products from each supplier (e.g., for fairness). Although there are m suppliers, in 
each iteration only a randomly chosen supplier can make price changes according to (13.811 using the predicted 
demand i*. In order to understand the buyer’s decision in (13.1011 . let us first denote 

Vi ■= My,{-x\yl~^,Tt), i = = 


(3.11) 
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Algorithm 3 A randomized primal-dual gradient (RPDG) method 


Let = X ^ G X, and the nonnegative parameters {rt}, {rjt}, and {at} be given. 

Set y9 =Vfi{x^), i = 
for t = 1,... , /c do 

Choose it according to Probjtt = 2 } = Pi, i = 1,... ,m. 

Update = {x^,y^) according to 


X* = Qt(x*“'- — X*~^) + X*~^. 

(3.7) 

t _ i = it, 

i¥^k- 

(3.8) 

,tip-Hyl-yl-^) + yl-\ i = k, 

(3.9) 

x^ = Mx{j:Ziyi,x^-kvt). 

(3.10) 

end for 



In other words, t = 1,..., m, denote the prices that all the suppliers can possibly set up at iteration t. Then 
we can see that 


^t[yl] = yl 


(3.12) 


Indeed, we have 



i = k, 
i ^ if 


(3.13) 


Hence ]Et[y|] = Pivl -|- (1 — Pi)y\ i = 1,... ,m. Using this identity in the definition of y^ in (13.9L we obtain 
(l3Tl) . Instead of using YllLiVi iri determining his order in (I3.10L the buyer notices that only one supplier has 
made a change on the price, and thus uses YULiVl to predict the case when all the dual players would modify 
the prices simultaneously. 

In order to implement the above RPDG method, we shall explicitly specify the selection of the subgradient 

in the definition of the dual prox-mapping in (13.81) . Denoting i = 1,... ,m, we can easily see from 

y^ = Vfi{x^) that G df*{y'f), i = 1,... ,m. Using this relation and letting Ji{y\~^) = x*~^ in the definition of 
Di{y\~^,yi) in (l3^ (see mM), we then conclude from Lemma [1] (with Jf = Ji^ and Df = Di^) and (13.811 that 
for any t > 1, 


+ nx^^ ^)/(l -b Tt), X, = X- \ Vi ^ it; 
ylt = V/it(^-J, yl = yl~^, Vi ^ k- 

Moreover, observe that the computation of in (ITTOI) requires an involved computation of YlT^iVl- order to 
save computational time, we suggest to compute this quantity in a recursive manner as follows. Let us denote 
g* = YULiVl- Clearly, in view of the fact that y\ = yl~^, Vi ^ it, we have 

t t-i , f t t-i\ 

9=9 +{yu ). 

Also, by the definition of g* and (123, we have 

j:T=iyl = +pz"(yl - yV)+yV 

= TZiylr^+Pnbii-yV) 

t-1 , -1/ t t-l\ 

= 9 +p,^ {yit-ytt )■ 


Incorporating these two ideas mentioned above, we present an efficient implementation of the RPDG method in 
Algorithmic 
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Algorithm 4 An efficient implementation of the RPDG method 

Let = x~^ G X, and nonnegative parameters {ot}, {tj}, and {rft} be given. 
Set x° = y9 = Vfi{x°), i = and g° = 

for t = 1, ... , do 

Choose it according to Probfit = i} = i = 1,..., m. 

Update 2 * := (x*,?/*) by 



x^ = Mx{g^ 1-2y*y),a;* 
g^=g*-^+yi-yl;\ 


end for 


(3.14) 

(3.15) 

(3.16) 

(3.17) 

(3.18) 


Clearly, the RPDG method is an incremental gradient type method since each iteration of this algorithm 
involves the compntation of the gradient V/i^ of only one component function. As shown in the following Subsec¬ 
tion, such an randomization scheme can lead to significantly savings on the total number of gradient evaluations, 
at the expense of more primal prox-mappings. 

It should also be noted that due to the randomness in the RPDG method, we can not guarantee that £* € V 
for all i = 1,..., m, and t > 1 in general, even though we do have all the iterates x* G X. That is why we need to 
make the assumption that fi’s are differentiable over M" for the RPDG method. 


3.2 The convergence of the RPDG algorithm 


Our goal in this subsection is to describe the convergence properties of the RPDG method for the strongly 
convex case when p > 0. Generalization of the RPDG method for the non-strongly convex case will be discussed 
in Sectional 

Theorem [2] below states some general convergence properties of RPDG. Similar to PDG method, we provide 
bounds on E[P(a:^, a:*)] and E[!?'(a:^) — <f"(a;*)]. However, we cannot provide a bound on the expected primal-dual 
gap E[gap(a;*^)] even though our analysis for the RPDG algorithm still relies on the primal-dual gap function Q 
in (iTll (see El for some relevant disucssions). 


Theorem 2 Suppose that {rt}, {rjt}, and {at} in the RPDG method are set to 

n = T, r]t = p, and at = a, 

for any t>\ such that 

(1 - a)(l -h r) < = 1 ,... ,m, 

V < + 

mPi > = l,...,m, 

for some a G (0,1). Then, for any k> 1, we have 

E[P(a;'=, r*)] < (l + a^P{x°, x*), 

E[>P{x'^) - <I'{x*)] < (^a~\ + P{x°, x*), 


(3.19) 


(3.20) 

(3.21) 

(3.22) 


(3.23) 

(3.24) 


where x^ = with {6t\ defined as in i 2 . 24 \), and x* denotes the optimal solution of problem 

im, and the expectation is taken w.r.t. ii,... ,ife. 
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We now provide a few specific selections of pi, r, rj, and a satisfying (I3.2r)l) - (I3.22I) and establish the complexity 
of the RPDG method for computing a stochastic e-solution of problem (ini),i.e., a point X € X s.t. E[P(a;, a;*)] < e, 
as well as a stochastic (e, A)-solution of problem (11.11) . i.e., a point x € X s.t. Prob{P(a;, x*) < e} > 1 — A for some 
A G (0,1). Moreover, in view of (13.2411 . similar complexity bounds of the RPDG method can be established in 
terms of the primal optimality gap, i.e. E[!l'(a;) — <!'*]. 

The following corollary shows the convergence of RPDG under a non-uniform distribution for the random 
variables it, t = 1,... ,k. 


Corollary 1 Suppose that {it\ in the RPDG method are distributed over {1,..., m} according to 

= Prob{it =i} = ^ + ^,i=l,...,m. 

Also assume that {rt}, {rjt}, and {at} are set to 13.1S\) with 


_ y/ (m—l)2-t-4mC—(m—1) _ Uy/ (m—l)2-t-4mC-t-p(m—1) 

~ 2m ’ 3 ~ 2 


and a = 1 — 


(m+l) + yy{m—ip+AmC ’ 


where 


C = 


_ 8L 


Then for any fe > 1, we have 


E[P(x^ ®*)] < (1 + ^)a>^P{x\ X*), 


¥.[T{x")a)~^ p, + 2Lf + ' 


P{x°, X*). 


(3.25) 

(3.26) 

(3.27) 

(3.28) 

(3.29) 


As a consequence, the number of iterations performed by the RPDG method to find a stochastic e-solution and a 
stochastic {e, X)-solution of 11.111 . in terms of the distance to the optimal solution, i.e., E[P(x^, a:*)], can he bounded 
by K{e,C) and K{\e,C), respectively, where 

K{e, C) := [(m + 1) + y/{m - 1)2 -p dmc] log [(1 -h • (3-30) 

Similarly, the total number of iterations performed by the RPDG method to find a stochastic e-solution and a stochastic 
{e, X)-solution of 11.11) . in terms of the primal optimality gap, i.e., E[!?'(a;^) — !?'*], can he bounded by K{e,C) and 
K{Xe,C), respectively, where 


K{e, C) := 2 |^(m -|- 1) -|- \/(m - 1)2 -|- 4mcj log 2(p -|- 2Lf -\- ^){m -\- 


(3.31) 


Proof. It follows from (13.261) that 


(1 - a)(l-h r) = l/(2m) < Pi, (1 - q)p = (a - l/2)p < ap, and rjTpi = pCpi > ALi, 

and hence that the conditions in (13.2011 - (13.221) are satished. Notice that by the fact that a > 3/4, Vm > 1 and 
(13.261) . we have 

1 + (l-a)77 - 1 (a-l/2)p ^ ^ ' 

Using the above bound in (I3.23L we obtain (13.281) . It follows from the facts (1 — a)rj < ap, 1/2 < a < l,Vm > 1, 
and rj > py/C > 2p that 

< (1 - + ^). 

Using the above bound in (13.2411 . we obtain (13.2911 . Denoting D = {1 + ^^)P{x^,x*), we conclude from (I3.28|) 
and the fact that logs; < a; — 1 for any x G (0,1) that 

-rr-r / K(i: log(P/e) log(D/e) log(e/g) 

E[P(x^^ ’ ,3: )] < Da < Da -iog« < Da 


= e. 
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Moreover, by Markov’s inequality, (13.281) and the fact that log* < x — 1 for any x G (0,1), we have 


PT0h{P{x^^^^’^\x*) > e} < ^E[P{x^'^^^'^\x*)] < 


<?a 


log(A€/P) 
log c 


= A. 


The proofs for the complexity bounds in terms of the primal optimality gap is similar and hence the details are 
skipped. ■ 

The non-uniform distribution in (13.2511 requires the estimation of the Lipschitz constants Li, i = 1,... ,m. In 
case such information is not available, we can use a uniform distribution for it, and as a result, the complexity 
bounds will depend on a larger condition number given by mmaxj=i Lj/^. However, if we do have Li = 
1/2 = • • • = Lm, then the results obtained by using a uniform distribution is slightly sharper than the one by 
using a non-uniform distribution in Corollary [T] 


Corollary 2 Suppose that {it} in the RPDG method are uniformly distributed over {1,..., m} according to 

Pi = Prob{it = i} = i, i = 1,... , m. 

Also assume that {rt}, {rjt}, and {at} are set to 1 ,?. 1.91) with 

^ _ yj{m-lp+AmC-{m-l) ^ _ uy/{m-iy+4mC+fJ,im-l) _ i 2 

2m ’ V 2 , ana a 1 (m+l) + ^(m-lp+4mC' 

where 

C:=^ max L,. 

Then we have 


E[P(a:^, a;*)] < (1 + ^)a^P{x°, x*), 

E[T{x'^) - T*] < - a)-i L + 2Lf + ^^'\ P{x°, x*). 


(3.32) 

(3.33) 

(3.34) 

(3.35) 

(3.36) 


for any k > 1. As a consequence, the number of iterations performed by the RPDG method to find a stochastic e- 
solution and a stochastic {e, \)-solution of 11.11) . in terms of the distance to the optimal solution, i.e., E[P(x^, a;*)], 
can be bounded by Ku{e,C) and Ku{\e,C), respectively, where 

Ku{e, C) := iar+l) + y/irn-ip +^ ^ . 

Similarly, the total number of iterations performed by the RPDG method to find a stochastic e-solution and a stochastic 
{e, \)-solution of 11.11) . in terms of the primal optimality gap, i.e., E[^'(a;^) — L'*], can be bounded by K{e,C)l2 and 
K{Xe,C)f2, respectively, where K{e,C) is defined in IS.S1\) . 

Proof. It follows from (13.3311 that 

(1 —a)(l-|-r) = l/m=pi, (1 — 0)77 — a/a = 0, and pr = pC > AmLi, 
and hence that the conditions in (I3.20I1 - (I3.22I1 are satished. By the identity (1 — a)ri = ap,, we have 

1 t t/fO! - y I Lf 

^ + TT- 

Using the above bound in (13.231) . we obtain (13.3511 . Moreover, note that p > pVc > 2/a and 2/3 < a < l,Vm > 1 
we have 

a-^V + ^ + 2 ^/ + ?)• 

Using the above bound in (I3.24L we obtain (13.361) . The proofs for the complexity bounds are similar to those in 
Corollary [1] and hence the details are skipped. ■ 


Comparing the complexity bounds obtained from Corollaries[T]and[2]with those of any optimal deterministic 
hrst-order method, they differ in a factor of 0{yJrnLjjL), whenever ■\/mC'log(l/e) is dominating in (13.301) . Clearly, 
when Lf and L are in the same order of magnitude, RPDG can save up to 0{y/m) gradient evaluations for the 
component function fi than the deterministic first-order methods. However, it should be pointed out that Lf 
can be much smaller than L. In particular, when Lf = Li,i = l,...,m, Lf = Ljm. In the next subsection, 
we will construct examples in such extreme cases to obtain the lower complexity bound for general randomized 
incremental gradient methods. 
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3.3 Lower complexity bound for randomized methods 


Our goal in this subsection is to demonstrate that the complexity bounds obtained in Theorem [2j and Corollar¬ 
ies [1] and [2] for the RPDG method are essentially not improvable. Observe that although there exist rich lower 
complexity bounds in the literature for deterministic hrst-order methods (e.g. [laUHl), the study on lower com¬ 
plexity bounds for randomized methods are still quite limited. Recently Agarwal and Bottou [1] suggested a lower 
complexity bound for minimizing the finite-sum convex optimization problem given in the form of (di). How¬ 
ever, their bounds are developed for deterministic algorithms and hence not applicable to randomized incremental 
gradient methods. 

To derive the performance limit of the incremental gradient methods, we consider a special class of uncon¬ 
strained and separable strongly convex optimization problems given in the form of 


mm 




]}■ 


(3.37) 


Here h = n/m £ {1,2,...} and || • ||2 denotes standard Euclidean norm. To fix the notation, we also denote 
X = (*1 ,..., Xm)- Moreover, we assume that /^’s are quadratic functions given by 

ft{xi) = [\{Axi,Xi) - {ei,Xi)\ , (3.38) 

where ei := (1,0,..., 0) and A is a symmetric matrix in given by 



-1 

2 

-1 

0 

0 

0 

0 



A = 

0 

-1 

2 

_1 ... 

0 

0 

0 

with K = . 

VQ+i 

(3.39) 











0 

0 

0 

0 ••• 

-1 

2 

-1 




\ 0 

0 

0 

0 ••• 

0 

-1 





Compared with the classic worst-case example given in [28] , the tridiagonal matrix A above consists of a different 
diagonal element k (instead of 2). This modihcation allows us to study problems of finite dimension more 
conveniently. It can be easily checked that A ^ 0 and its maximum eigenvalue does not exceeds 4. Indeed, for 
any s = (si,..., Sn) G M", we have 


(As, s) = sf -|- ('Sj “ -f (k — l)s| > 0 

(As, s} < sf + Z]r=i^2(sf -h sf^^) -h (k — l)s| 

= 3sf -t- + ('^ + ^ 4 ||s||2, 


where the last inequality follows from the fact that k < 3. Therefore, for any Q > 1, the component functions 
fi in (13.381) are convex and their gradients are Lipschitz continuous with constant bounded by Li = fj,{Q — 1), 
i = 1,..., m. 

We consider a general class of randomized incremental gradient methods which sequentially acquire the 
gradients of a randomly selected component function fi^ at iteration t. More specihcally, we assume that the 
independent random variables it, t = 1,2 ,satisfy 


Probjit = ij = Pi and Pi > 0,i = I,... ,m. 

Similar to |28| . we assume that these methods generate a sequence of test points {x^} such that 

x'" £ x° + Lin{V/ii • • •: '^fik 


(3.40) 


(3.41) 


where Lin denotes the linear span. 

Theorem |3| below describes the performance limit of the above randomized incremental gradient methods for 
solving (13371) . 
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Theorem 3 Let x* be the optimal solution of problem and denote 

— %/Q-i 


q : = 


VQ+1' 


Then the iterates {a:"^} generated by any randomized incremental gradient method must satisfy 


II2] \ 1 / 




4fcx/Q 


for any 


n > 


'm(^/Q+lT-4^/Q 


) 


(3.42) 


(3.43) 


(3.44) 


As an immediate consequence of Theorem [S] we obtain a lower complexity bound for randomized incremental 
gradient methods. 


Corollary 3 The number of gradient evaluations performed by any randomized incremental gradient methods for find¬ 
ing a solution x € X of problem El such that E[||a; — a;* Hi] < e cannot be smaller than 

n I (^VmC + log — —A—lk | 

if n is sufficiently large, where C = Ljfj, and L = 


Proof. It follows from (I3.43|l that the number of iterations k required by any randomized incremental gradient 
methods to hnd an approximate solution x must satisfy 


k > 


I m(\/C+l)^ 
V 4\/C 


- 1 


) log- 


> 


2e — 2 I 2 


m ( 


+ - ij log ■ 


2e 


(3.45) 


Noting that for the worst-case instance in (I3.37L we have Li = fi{Q — 1), i = and hence that L = 

~ !)• Using this relation, we conclude that 


k > 


1 { \/mC+m'^ 

2 2 


+ m 


)-] 


log¬ 


it. 


The above bound holds when n > n(m, fc). 


In view of Theorem [S] we can also derive a lower complexity bound for randomized block coordinate descent 
methods, which update one randomly selected block of variables at each iteration for Here T is 

smooth and strongly convex such that 

^\\x-y \\2 < ^{x) - 'T{y) - {V<T{y),x-y) < if-\\x - y\\l,\fx,y G X. 


Corollary 4 The number of iterations performed by any randomized block coordinate descent methods for finding a 
solution X G X o/min^jgx 'L{x) such that E[||x — x*]]!] < c cannot be smaller than 

L2 I (rn^/^^ log | 

if n is sufficiently large, where Qip = L^ /pq, denotes the condition number of T. 

Proof. The worst-case instances in (13.3711 have a block separable structure. Therefore, any randomized in¬ 
cremental gradient methods are equivalent to randomized block coordinate descent methods. The result then 
immediately follows from (13.451) . ■ 


4 Generalization of randomized primal-dual gradient methods 

In this section, we generalize the RPDG method for solving a few different types of convex optimization problems 
which are not necessarily smooth and strongly convex. 
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4.1 Smooth problems with bounded feasible sets 

Our goal in this subsection is to generalize RPDG for solving smooth problems without strong convexity (i.e., 
/i = 0). Different from the deterministic PDG method, it is difficult to develop a simple stepsize policy for {rt}, 
{rit}, and {at} which can guarantee the convergence of this method unless a weaker termination criterion is 
used (see my In order to obtain stronger convergence results, we will discuss a different approach obtained by 
applying the RPDG method to a slightly perturbed problem of (ini). 

In order to apply this perturbation approach, we will assume that X is bounded (see Subsection 14.31 for 
possible extensions), i.e., given xq G X, > 0 s.t. 

ina.xPui{xQ,x) < f2x. (4-1) 

x^X 


Now we define the perturbation problem as 

'Pg := uiin{'Ps{x) := f{x) + h{x) + 5Pu,{xo,x)} , (4.2) 

xGX 

for some fixed <5 > 0. It is well-known that an approximate solution of (14.21) will also be an approximate solution 
of (11.11) if S is sufficiently small. More specihcally, it is easy to verify that 

<P* <Ps <P* +5f2x, (4.3) 

>P{x) <Ps{x) <'P{x) + SQx, yxeX. (4.4) 


The following result describes the complexity associated with this perturbation approach for solving smooth 
problems without strong convexity (i.e., /r = 0). 


Proposition 1 Let us apply the RPDG method with the parameter settings in Corollary\T\to the perturbation problem 
{4-^ with 


5 = 


2f2 


for some e > 0. Then we can find a solution x £ X s.t. E['f'(a;) — <P*] < e in at most 


O 


I log I 


iterations. Moreover, we can find a solution x £ X s.t. Prob{>f'(a;) — >P'* > e} < A for any A € (0,1) in at most 


O 


I log filial 


(4.5) 


(4.6) 


(4.7) 


iterations. 


Proof. Let Xg be the optimal solution of (14.21) . Denote C := IQLDx/e and 

K -.= 2 ^{m + 1) -(- •\/(m — 1)2 -p dmcj log (m VmC){S + 2Lf + -^)^^ 


It can be easily seen that 

T{x^) -T* < Tg{x^) - Tg + sn% = + i 

Note that problem (Ha is given in the form of dni with the strongly convex modulus p, = 5, and h}x) = 
h{x) — S(uj'(xo),x}. Hence by applying Corollary[Tl we have 

El'Pg(x^)-Tg*]<§. 
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Combining these two inequalities, we have — 'I'*] < e, which implies the bound in (14.61) . The bound in 

(liTl) can be shown similarly and hence the details are skipped. ■ 

Observe that if we apply a deterministic optimal first-order method (e.g., Nesterov’s method or the PDG 
method), the total number of gradient evaluations for V/i, i = 1,..., m, would be given by 

! Lfn\ 
my ’^ ^ . 

Comparing this bound with (gSl), we can see that the number of gradient evaluations performed by the RPDG 
method can be O {^y/rn\og~^{mLffixh)) times smaller than these deterministic methods when L and Lf are in 
the same order of magnitude. 


4.2 Structured nonsmooth problems 


In this subsection, we assume that the smooth components fi are nonsmooth but can be approximated closely 
by smooth ones. More specifically, we assume that 


fi{x) := raSiK{AiX,yi) - qi{vi). 

Vi&Yi 

Nesterov in an important work |29| shows that we can approximate fi{x) and /, respectively, by 
fi{x,5) := iaax{AiX,yi) - qi[yi) - Svi{yi) and f{x,S) 

Vi&Yi 


(4.8) 


(4.9) 

(4.10) 

(4.11) 

for any x G X, where fly = - Moreover, fi{-,S) and f{-,5) are continuously differentiable and their 


where Vi{yi) is a strongly convex function with modulus 1 such that 

Q < Vi{yi) < fivi, 

In particular, we can easily show that 

fiix,5) < fi{x) < fi{x,S) + SOy, and /(*, 5) </(*)</(*, 5) + 5fly, 

= I:T=i^y 

gradients are Lipschitz continuous with constants given by 


T, = 


P*ll 


and L = 


f _ _ IIAII^ 


s " ^ 

respectively. As a consequence, we can apply the RPDG method to solve the approximation problem 

'i's ■= min := f{x,S) -|- h{x) + /ica(a;)} . 


(4.12) 


(4.13) 


The following result provides complexity bounds of the RPDG method for solving the above structured 
nonsmooth problems for the case when y > 0. 

Proposition 2 Let us apply the RPDG method with the parameter settings in Corollary[^to the approximation problem 
with 

5 = 


201 


for some e > 0. Then we can find a solution x G X s.t. E[!f"(a:) — T*] < e in at most 


o{\\A\\nY 


. Tn\\A\\nxOY 

fie 


iterations. Moreover, we can find a solution x G X s.t. Prob{!f'(a;) — T* > e} < X for any \ G (0,1) in at most 


(4.14) 

(4.15) 

(4.16) 


iterations. 
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Proof. It follows from (14.1111 and (14.1311 that 


(4.17) 


Using relation (lil^ and Corollaries [TJ we conclude that a solution £ X satisfying E[^'5(a:^) — 'Pg] < tl2 can 
be found in 

o[\\A\\Qy^^\og 




iterations. This observation together with (14.1711 and the definition of L in (14.1211 then imply the bound in (14.1511 . 
The bound in (14.1611 follows similarly from (14.1711 and Corollaries [TJ and hence the details are skipped. ■ 

The following result holds for the RPDG method applied to the above structured nonsmooth problems when 

/i = 0. 


Proposition 3 Let us apply the RPDG method with the parameter settings in CoroUary\J\to the approximation problem 
with S in {4-14^ for some e > 0. Then we can find a solution x £ X s.t. E[!?"(a:) — < e in at most 

Q I v/m||A||nxtly 7Tt||A||i?x | 

iterations. Moreover, we can find a solution x £ X s.t. Prob{>f'(a;) — T* > e} < X for any \ £ (0,1) in at most 

Q I v/rra||A||nx j^g m||A||J7xt?y | 

iterations. 

Proof. Similarly to the arguments used in the proof of Proposition |2j our results follow from (14.1711 . and an 
application of Proposition [1] to problem (14.1311 . ■ 

By Propositions m and |3l the total number of gradient computations for f{-,5) performed by the RPDG 
method, after disregarding the logarithmic factors, can be 0{g/m) times smaller than those required by deter¬ 
ministic first-order methods, such as Nesterov’s smoothing technique |29| . 


4.3 Unconstrained smooth problems 


In this subsection, we set X = M", h{x) = 0, and p = 0 in (11.111 and consider the basic convex programming 
problem of 

/* := min {/(a;) :=X;™ J,(a;)}. (4.18) 

We assume that the set of optimal solutions X* of this problem is nonempty. 

We will still use the perturbation-based approach as described in Subsection 14. 1 1 bv solving the perturbation 
problem given by 

fs := min |/5(a;) :=/(a;)-h|||a;-a;°||l,| (4.19) 

for some x'^ £ X, 6 > 0, where || • ||2 denotes the Euclidean norm. Also let Lg denote the Lipschitz constant for 
fs{x). Clearly, Lg = L + 6. Since the problem is unconstrained and the information on the size of the optimal 
solution is unavailable, it is hard to estimate the total number of iterations by using the absolute accuracy in 
terms of E[f{x) — /*]. Instead, we define the relative accuracy associated with a given x £ X by 


Rac{x,x°,f*) := 


L(l+min„gx* \\x°-u\\l)' 


(4.20) 


We are now ready to establish the complexity of the RPDG method applied to (14.181) in terms of Rac{x, x^, f*). 
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Proposition 4 Let us apply the RPDG method with the parameter settings in Corollary\^to the perturbation problem 
i4.19^ with 

5 = if, (4.21) 

for some e > 0. Then we can find a solution x € X s.t. ¥,[Rac{x,x^, f*)] < e in at most 

o{v/^ log’ll (4.22) 

iterations. Moreover, we can find a solution x G X s.t. Prob{7?ac(2;, , f*) > e} < X for any X G (0,1) in at most 

C?{/¥log£} (4-23) 

iterations. 

Proof. Let be the optimal solution of (14.1911 . Also let x* be the optimal solution of (14.181) that is closest to 
x^, i.e., X* = argmin„gj)f» ||x° — u\\ 2 . It then follows from the strong convexity of fs that 

III®! - x*\\2 < fsix*) - fs{x*s) 

= fix*) + III** - a:°||2 - fs{x*5) 

< III** -*°||i, 

which implies that 

11*5 - 2:*l|2 < II** - *°||2- (4.24) 

Moreover, using the definition of fg and the fact that ** is feasible to (1133, we have 

<fs<f + §11* -* lb, 

which implies that 

/(*^) - /* < fs{x^) - ft + fs - f 

</5{*^)-/l + |ll**-*°lli- 

Now suppose that we run the RPDG method applied to (14.191) for K iterations. Then by Corollary [U we have 

^[fs{x^) — f] < a‘^^^(l — a) ^ ('^ + 2-I'5 + -x) ll*° “ *^112 

< - a)~^ (^5 + 2Ls + j [||*° - **||i + ||** - xl\\l] 

= - a)-^ (SS + 2L+ ^^4^) ll*° “ **ll2, 

where the last inequality follows from (14.241) and a is dehned in (13.261) with C = 8Lg/S = = 8(2/e + 1). 

Combining the above two relations, we have 

E[/(**^) - /*] < [20^/41 - «)”^ (35 + 2L+ ^^4^) + 1] [ll®° - ®*ll2- 

Dividing both sides of the above inequality by L(1 + ||*° — **||2)/2, we obtain 

E[Rac{x^,x°,f)] < f [20^/41-q)-i (^35 + 2 L +-^^ 4 ^) + I] 

< 4 + 2Y^2m(g + 1)^ (3e + 4 + (2 + e)(f + 1)) + §, 

which clearly implies the bound in (14.221) . The bound in (14.231) also follows from the above inequality and the 
Markov’s inequality. ■ 

By Proposition m the total number of gradient evaluations for the component functions fi required by the 
RPDG method can be 0{^/rn\og~^[m/e)) times smaller than those performed by deterministic optimal hrst-order 
methods. 
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5 Complexity analysis 

Our main goal in this section is to prove the main theorems in Sections [2] and [S] After introducing some basic 
tools and general results about PDG and RPDG methods in Subsection 15.II and 15.21 respectively, we provide the 
proofs for Theorem [1] and Theorem [2l which describe the main convergence properties for the PDG and RPDG 
methods, in Subsection l5.3l Moreover, in Subsection [531 we provide the proof for the lower complexity bound in 
Theorem |3l 


5.1 Some basic tools 

The following result provides a few different bounds on the diameter of the dual feasible sets Q and y in (12.711 
and (1,3.11) . 

Lemma 3 Let € X be given, yf = V/i(a;°), i = l,...,m, and = V/(a:°). Assume that and 

J'j{g^) = in the definition of D[y^ ,y) and Df{g^,g) in IMIMj respeetively. 

a) For any x £ X and i/j = Vfi{x), i = 1,... ,m, we have 

D{y°,y) < ^\\x° -xf < LfP{x°,x). (5.1) 

b) If X* £ X is an optimal solution of 11. Ill and y* = X fi{x*), i = . ,m, then 

D{y°,y*)<<F{x°)-F{x*). (5.2) 

c) For any x £ X and g = V/(®), we have 

Df{g°,g)<^\\x'^-xf. (5.3) 

Proof. We first show part a). It follows from the definition of J,, drill . and (IrB that 

D{y°,y) = J{y)- J{y°) - - Vi) 

= {x,Uy) - f{x) + f{x°) - {x°,Uy°) - {x°,U{y - y°)) 

= f{x°) - f{x) - {Uy,x° - x) 

< ^ LfP{x°,x), 

where the last inequality follows from (1^ . We now show part b). By the above relation, the convexity of h and 
w, and the optimality of {x*,y*), we have 

D{y°,y*) = f{x°) - f{x*) - {Uy*,x° - x*) 

= f{x°) - f{x*) + {h'{x*) + pcj'{x*),x° - X*) - {Uy* + h'{x*) + tiuj'(x*),x° - x*) 

< f{x^) — f{x*) + {h'{x*) + jiuj'[x*),x^ — X*) < F[x^) — 'F{x*). 

The proof of part c) is similar to part a) and hence the details are skipped. ■ 

The following lemma gives an important bound for the primal optimality gap F{x) — <F{x*) for some x £ X. 

Lemma 4 Let {x, y) £ Z be a given pair of feasible solutions of IS. II) . and z* = (a:*, y*) be a pair of optimal solutions 
of IS. II) . Then, we have 


^{x) - ^{x*) < Q{{x, y),z*) + ^||a; - x 


(5.4) 
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Proof. Let j/* = (V/i(a;); V/2(a;); • • •; Vfm{x)), and by the definition of Q{-, •) in (13.31) . we have 

Q{{x,y),z*) = [h{x) + yui{x) + {x,Uy*) - J{y*)] - [h{x*) + yuj{x*) + {x*,Uy) - J{y)] 

> [h{x) + fj,uj{x) + {x,Uy*) - J{y*)] + {x,U{y* - y*)) - J{y*) + J{y*) 

- h{x*) + y,uj{x*) + uia.x{{x*,Uy) - J{y)] 
y&y 

= <I'{x) - <I'{x*) + {x,U{y* - y*)) - {x*,Uy*) + f{x*) + {x,Uy*) - f{x) 

= 'P{x) — 'P{x*) + f{x*) — f{x) + {x — X* , V f{x*)) > <I'{x) — -^11® “ 

where the second equality follows from the fact that Jj, i = 1,..., m, are the conjugate functions of /i. ■ 


5.2 General results for both PDG and RPDG 

We will establish some general convergence results in Proposition which holds for both deterministic and 
randomized PDG methods by viewing PDG as a special case of RPDG with m = 1. Then both Theorems [1] and 
[2] follow as some immediate consequences of Proposition [5l 

Before showing Proposition [5] we will develop a few technical results. Lemma [5] below characterizes the 
solutions of the prox-mapping in (ESI) and (1331) . This result generalizes some previous results (e.g., Lemma 6 of 
|2n| and Lemma 2 of M)- 

Lemma 5 Let U be a closed convex set and a point u £ U be given. Also let w : U ^ S. be a convex function and 

W{u, u) = w{u) — w{u) — {w'{u),u — u), (5.5) 

for some w'(u) € dw{u). Assume that the function g : ?7 —^ K satisfies 

q{ui) - q{u 2 ) - {q {u 2 ),ui - U 2 ) > yoW{u 2 ,ui), yui,U 2 eU (5.6) 

for some yo > 0. Also assume that the scalars pi and p2 are chosen such that po + Mi + A*2 >0. If 

u* £ Argmin{g(u) + y\w{u) + fj. 2 W{u,u) : u £ U}, (5-7) 


then for any u £ U, we have 

q{u*) + piw(u*) + p2W(u,u*) + (po + pi + P2)1 T(u*,u) < q{u) + piw(u) + p2W(u,u). 

Proof. Let 0 (m) := g(u) + ii\w{u) + y,2W{u,u). It can be easily checked that for any ui,U2 G U, 

W{u,Ul) = W{u,U2) + {w' {u,U2),Ul — U2) + W{u2,Ul), 
w{ui) = w{u2) + {w {U2),UI — U2) + W{u2,Ul). 

Using these relations and dEl), we conclude that 

0(ui) - (^(112) - {<}>'{U2),UI - U2) > (po + Pi + P2)IU(u2,Ul) (5.8) 

for any ui,U2 € Y, which together with the fact that po + pi + P2 > 0 then imply that is convex. Since u* is 
an optimal solution of (E3, we have {fi'{u*),u — u*) > 0. Gombining this inequality with dSJl), we conclude that 

(j){u) - 4 >{u*) > (po + pi + P2)IU(ll*, u), 

from which the result immediately follows. ■ 

The following simple result provides a few identities related to y* and y* that will be useful for the analysis 
of the PDG algorithm. 
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Lemma 6 Let y*, y*, and y* be defined in 1X51) . JXl) , and respectively. Then we have, for any i = 1,... ,m 

and t = 1,... ,k, 

= PiDfiyl~^,yl), (5.9) 

^t[Dfiyl,yi)] = PiDi{yl,yi) + {1 - ,yi), (5.10) 

for any y € y, where Ej denotes the conditional expectation w.r.t. it given i\,..., it-i- 

Proof (15.91) follows immediately from the facts that Probt{y| = y|} = Probtjit = i} = Pi and Probt{y* = 
y*“^} = 1 —pi. Here Probt denotes the conditional probability w.r.t. it given ii,... ,it-i. Similarly, we can show 
(IXTOl) . ■ 

We now prove an important recursion about the RPDG method. 

Lemma 7 Let the gap function Q be defined in iS.S\) . Also let and if be defined in and mm, respectively. 

Then for any t > 1, we have 

E[Q((a;*,y*),2)] < E |^ytP(a:*“\x) - {y + yt)P{x\x) - ?7tP(a;*“\a;*)j 

+ EHi® [(P)"^(l + -rt) - 1) Di{yl-^,y,) -p-^{l + Tt)Di{yl,y,)] 

+ E |^(i*-a;*,P(y*-y))-rtp“^Pj((y-“\yljj , Vz € Z. (5.11) 

Proof. It follows from Lemma [S] applied to (13.101) that \lx £ X, 

(x* - X, Uy') + h{x*) + yLj{x*) - h{x) - yLu{x) < ritP{x*~^,x) - {y + rit)P{x*, x) - ritP{x*~^ ,x*). (5.12) 

Moreover, by Lemma [5] applied to (13.111) . we have, for any i = 1,..., m and t = 1,..., fc, 

{-x,Vi - Vi) + JiiVi) - MVi) < xtDi{yl~^,yi) - (1 + Tt)Pi(y|, y^) - 

Summing up these inequalities over i = 1,..., m, we have, Vy £ y, 

{-x\U{y* - y)) + J{y) - J{y) < YliLi 2/i) - (1 + xt)Di{yl,yi) - rtPi(y‘“\y|)] . (5.13) 

Using the definition of Q in (13.31) . (15.121) . and (15.131) . we have 

Q{{x*,y*),z) < ritP{x^~^,x) - {y + rit)P{x*,x) - ritP{x^~^,x^) 

+ Eili [xtDi{yl~'^,yi) - {1 A Tt)Di{y\,yi) - rtPi(y-“\y|)] 

+ {x,U{y - y)) - {x\U{y* - y)) + {x,U{y* - y)). (5.14) 

Also observe that by (13.8L (13.121) . (15.91) . and (15.101) . 

Di{yl~^,y\) =0, Vi it, 

E[(*,U(y‘-y*))]=0. 

E[(i‘,[7y‘)] =E[(i‘,[7y‘)], 

E[A(y‘-\y|)] = E[p-iA(y‘-\y‘)] 

E[Pi(yi,yi)] =p“^E[Pi(y-,yi)] - {p~^ - l)E[Pt(y-“\yi)], 

Taking expectation on both sides of (15.141) and using the above observations, we obtain (15.111) . ■ 


We are now ready to establish a general convergence result which holds for both PDG and RPDG. 
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Proposition 5 Suppose that {rt}, {%}, and {at} in the RPDG method satisfy 

6t + n) - l) < + '^t-\),i = 1,.. . ,m;t = 2 ,... ,/c, 

dtVt < + Vt-i),t = 2,...,k, 


,i,je{l,...,m}-,t = 2,...,k, 


hjl > 
2 — 


J:ZAp^L^) 

l+Tfe 


atdt = dt-i,t = 2,... ,k, 

for some 9t > 0, t = 1,... ,k. Then, for any k>\ and any given z £ Z, we have 

y*)’ ^)] ^ ViyiP{x°, x)- {p + ? 7 fe) 6 lfcE[P(a;'', x)] 

+ (^*”^(1 + "^ 1 ) - l) D,{y°,yi). 

Proof. Multiplying both sides of (15.111) by dt and summing the resulting inequalities, we have 

y*)’Z] < IE {vtP{x*~^,x) - (m + Vt)P{x*, x) - 77tP(x‘“\ ®*)) j 

+ YT=i^ (ELi^* + Tt) - 1) yi) - pZZ + n)D^{y\,yi)\ | 

+ E [eLi^* - y)) - Ttp~^Di^{yl-^,yl))j , 

which, in view of the assumptions in (15.1611 and (15.1511 , then implies that 

®[ELi^‘Q((®*’ y*)’ *)] ^ ViSiPZ^, ®) - (p + a:)] 

+ YZLi [fi'i (p*”^(1 + 'g)-1) + 

- E [Eti^t^*] . 

where 

At ■■= ptP{x*~^,x*) - {x - x\u{y - y)) + Ttp~^Di^{yl~^,yZ. 

We now provide a bound on Ei=i^t"^i (15.2211 . Note that by (13.711 . we have 

(i‘ - x\U{y - y)) = {x*~^ - x\U{y - y)) - at(a;*“^ - x*~^,U{y* - y)) 

= - x*,U{y - y)) - at{x^~‘^ - a;*“\ - y)) 

- at{x*~'^ - x*~^,U{y - v~^)) 

= - x\U{y - y)) - at{x'^~‘^ - x*~^,U{y*~^ - y)) 

- 1 / t -2 t -1 t 

- atPjt (* -X ,yi,^- ) 

/ -1 in; t -2 t -1 t -2 t-l\ 

- 1 )(® -a; ,y^,_,-v^,_,), 

where the last identity follows from the observation that by (13.811 and (13.911 . 

t/(y* - = EEi {[pZivl - yZ^) + yn - bz^iy^ - yZZ + y^} 

= EEi { bk^vl - (pZ - Zyn - \p7^yr - ip7^ - Zy^} 

= EEi [piZvl - yZ") + ip7" - m-" - yl-")] 

= pZZvh - yV) + (Pzh-i - ZiytZ, - yt-\)- 


(5.15) 

(5.16) 

(5.17) 

(5.18) 

(5.19) 

(5.20) 


(5.21) 


(5.22) 

(5.23) 


(5.24) 
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Using relation (15.241) in the definition of At in (I5.23L we have 

'L\=ietAt = Y!;=A[ritP{x^-\x^) 

- {x*~^ - x*,U{y - y)) + at{x*~‘^ - x*~^,U{y~^ - y)) 

, -1/ t-2 t-1 t t-l\ , / -1 i\/ t-2 t-1 t-2 t-l\ 

-\-atp,^ {x -X ,yi,-y,^ ) + at{p,^_^-l){x -X .y^,_,-yi,_J 

+ PpnD,^{ylp ,yl 

Observe that by (15.201) and the fact that x~^ = x° 


(5.25) 


Uiv* - y)) - - x*-\U{y*-^ - y)}] 

= dk{x^~^ - x^,U{y’" - y)) 

= -x'^,U{y*' -y)) +ek{x'^~^ -x^,U{y^ - V^)) 

= 8k{x^~^ -x^,U{y'" -y))+ek{p~^ - l){x^~^ -x’',yt 

where the last identity follows from the definitions of y^ and in (13.81) and (13.911 . respectively. Also, by the 
strong convexity of P and Di, we have 


7-)/ t—1 t\ 1 II t—1 

F[x , X j > 2 If ~ ■ 


and Di^{yip,yl)> 




■Vitl 


Using the previous three relations in (15.251) . we have 

ElAAt > ELA [f ||x‘-i - + atpp(x^-^ - x*-\yl - yip) 

, / -1 i\/ t-2 t-1 t-2 t-l\ I Tf II t-1 t ||2 

/I / k—l k rr/ k \\ n /' —1 i\/ A;—1 k k k—l\ 

-Sk{x -X ,U{y -y))-0k{p^^ -1)(* -X ,y^^-y,^ )■ 

Regrouping the terms in the above relation, and the fact that x~^ = x^, we obtain 


U(y^ - y 

1/1 1 fc||2 /—I i\/fc—1 k k k—l\, 

+ -a; II -(Pi, -l){x -X ,y,,-yt^ ) + 4Lt,p,, 


k-1 k ||2 

yij 




ITT 




I M„t-2 „t-l „ t-2 „ t-1 

- tRa; -X ,y,, -y, 


+ Et=2 \(^MPiP - 1)( 

I 9t-ir)t-\ ll^t-2 _ t-l||2 

' Z^t=2 2 II 

> Ot, - x’^f - {x^-^ - x\ U(y'^ - 

-1-0, ( UK — (1—Pifc ) 

^ fc 4 '^kPi, 


Tt-i9t-i 11 ,t-2 ,,t-l ||2 




+ Et=2 

= 2^ 


II k-1 k\\2 

\\x — X \\ 


2 '7"t — l^t — lPi^_2 


.t-2 _t-l||2 


II**-" - 


dk - y))] 


fe ||2 


+ EE2^*-1 ^ - 


Vt-1 


TtPit 




|Ut-2_^t-l||2 


> 


/I [pi- II k—l k\\2 / k—l k TTf k 

[-tIf -X II - {x -X ,U{y -y 


(5.26) 
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where the second inequality follows from the simple relation that 

b{u,v) + a||D||^/2 > —ti^||w|p/{2a),Va > 0, {5-27) 

and the last inequality follows from (15.1711 and (15.1811 . Plugging the bound (15.2611 into (15.2211 . we have 

< 0irixP{x°,x) - + ri,,)^[P{x^,x)] + YT=i^^ (pr^(l + n) - l) Di{y°,yi) 

- Sk^ - {x^~^ - x",U{y'' - y)) + YjT=iPT^+ '^k)Di{y’l^yi^ . 

Also observe that by (15.1911 and (I5.27L 

^\\x'"~^ - x'"f - {x'"~^ - x^,U{y^ - y)) + YT=iPi^ 0- + 'kk)Di{yi ,yi) 

> + YZi [-(*''■' - - y^) + - y^f\ 

>0. 

The result then immediately follows by combining the above two conclusion. ■ 

5.3 Proof of main convergence results 

We now provide a proof for Theorem [1] which describes the main convergence properties of the deterministic 
PDG method. 

We first specialize Proposition [5] for the PDG method applied to (12.711 . 

Proposition 6 Suppose that {rt}, {%}, and {at} in the PDG method satisfy 

Otn < + Tt-i),t = 2,...,k, (5.28) 

dm < dt-iih + rit-i),t = 2,...,k, (5.29) 

rjt-iTt>2Lfat,t = ‘2,---,k, (5.30) 

Vk{l + rk)>‘iLf, (5.31) 

at = 9t-i/0t,t = 2,... ,k, (5.32) 

for some > 0, t = 1,..., fc. Also let us denote = (x*,g*), and 

(5-33) 

Then, for any k>\ and any given {x, g) € X x Q, we have 

(Et=i^‘) + dk{lk + gk)P{x^,x) < emP{x°,x) +eiTiDf{g°,g). (5.34) 

Proof. Notice that in the deterministic PDG method, we have m = 1, pi = 1, and y* = g*. It can be easily 

seen that the assumptions in (I5.15I1 - (I5.20I1 are implied by those in (I5.28I1 - (I5.32II . It then follows from (15.2111 that 

T,t=i^tQfi^*^ z) < 9iniPix°, x) -9u{g + nk)P{x^, x) + 9iTiDf{g°, g). 

Dividing both sides of the above inequality by Et=i^t ^^*3 using the convexity of Q{z, z) w.r.t. z, we have 
(ELi^*) < Ylt=i^tQf{z\z) < 9iyiP{x°,x) -9k{g + rik)P{x'',x) + 9iTiDf{g°,g). 

Rearranging the terms in the above relation, we obtain (15.3411 . ■ 

We are now ready to show Theorem [TJ 
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Proof of Theorem [T] We first show part a). It can be easily checked that (I5.28II - (I5.32I) are satisfied with the 
selection of {rt}, {rjt}, {«*}, and {9t} in (12.2411 . Using (15.3411 (with x = x* and y = y*), (15.31) . and the fact that 
Qf{z,z*) > 0, we have 

Sk{fJ' + Vk)P{x'',x*) < 0i{r]i + LfTi)P{x°,x*), Vfc > 1. 

Using the parameter settings in (12.2411 . we conclude that 


P{x^, X*) < 


Si(rii+L fT^) 

Skik'+ilk) 


P{x^, X*) 




Also using (ISTMll and the fact that P{x^, x) > 0, we have 


(ELi^*) < diyiP{x°,x) + 9iTiDf{g°,g), Mz £ Z. 


(5.35) 


Denoting g* := (V/i(x^);...; V/m(a;^)), we conclude from (15.311 that 


Df{g\gt)<^^\\x^-xY < ^ 

< ^[Eti^*]-'Eti^t(ll*‘ -+ ll*° -) 

< ^ p^^^P{x°,x*) + \\x°-x*f'] <Lf (^^i^^p{x°,x*), 

where the second inequality follows from the convexity of || ■ |p, the third inequality follows from the triangular 
inequality, the fourth inequality follows from \\x* — x*\\'^ < 2P{x*,x*) and (12.2511 . and the last inequality follows 
from ||a:° — < 2P{x^,x*). Also note that by the definition of 9t, we have 


E k n _ 

t=i^t Et=i 


-1 fc 1 

1 —a ^ 1 

(1 —Q:)a^ — Ct^ ’ 


(5.36) 


where the last inequality follows from the fact that a < 1 due to (12.2411 . Fixing g = g^ in (15.3511 and using the 
above two relations, we obtain 


Qf{z'^,(.x,g'i)) <a^ \9igiP{x° ,x) + Lf9iTi j P{x°,x*)^ 

= ^ + ^(2 + ^)Pix°,x*)] . 

The result in (12.2611 then directly follows from the above relation and (12.2111 . If A is bounded, the result in (12.2711 
then follows from the above relation, (ESDI, and (|^^^2j} . 

We now show part b). It is trivial to check that the conditions in (I5.28I1 - (I5.32I1 hold by using our selection of 
{’’’t}, {vt}, {ctt}, and {9t}. Using (I5.34p and the facts n = 0 and P{x^,x) > 0, we have 


(Et=i^t) < 9igiP{x°,x) = 4:LfP{x°,x). 

which, in view of (12.2011 and (12.2111 and the fact that Et=i^t = + l)/2i clearly implies (12.2911 . In case X is 

bounded, the result in (12.3011 immediately follows from (12.2111 , (12.2211 , and the above inequality. ■ 

We are now ready to provide a proof for Theorem [2l which describes the main convergence properties of the 
RPDG method applied to strongly convex problems with p > 0. 


Proof of Theorem E It can be easily checked that the conditions in (I5.15I1 - (I5.20I1 are satisfied with our require¬ 
ments (I3.19I1 - (I3.22I1 of {n}, {at}, and {9t}. Using the fact that Q{{x*,y*),z*) > 0 , we then conclude from 

(15.2111 (with X = X* and y = y*) that, for any fc > 1, 


IE[P(*^**)] < em [^^ 9 P{x°,x*) + ^{^D{y°,y*)] < (l + a^P{x\x*), 


where the first inequality follows from (ITT91) and (13.2011 , and the second inequality follows from (13.2111 and dSID. 
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Let us denote In view of (15.41) . the convexity of || ■ ||, and (|2.2I) . 

we have 


< E[Q(^-^ z*)] + ] 

< E[Q(S^ z*)] +Lf{Y!:=iOt)-^nT!l=ietP{x\ X*)]. 

Using (15.211) (with x = x* and y = y*), the fact that P{x^, x) > 0, and (I5.36L we obtain 

E[Q(z",z*)] < Eti^‘E[Q((rr‘,y*),z*)] < a>^ (a'S + P{x°,x*). 

We conclude from (13.231) and the definition of {6t} that 

= IT^(l + + ^,)P{x°,x*). 

Using the above two relations, and (15.371) . we obtain 

E[^(x'=) - ^(x*)] < 0 ^= (a-S + P{x°,x*) + Lf2a^^^ (l + P(x°,x*) 


(5.37) 




0 *\ 


5.4 Proof of the lower complexity bound 

This subsection is devoted to the proof of Theorem [S] which describes the performance limit for randomized 
incremental gradient methods. 

The following result provides an explicit expression for the optimal solution of (13.371) . 

Lemma 8 Let q be defined in {3.^^ , x* j is the j-th element of x^, and define 

xlj = qfii = 1,... ,m;j = 1,... ,h. (5.38) 

Then x* is the unique optimal solution of i fg.gTp . 

Proof. It can be easily seen that q is the smallest root of the equation 

q^ -2^q + l = 0. (5.39) 

Note that x* satisfies the optimality condition of (13.371) . i.e., 

{A+-^^l)x* = ei, i = l,...,m. (5.40) 

Indeed, we can write the coordinate form of (15.401) as 

2 — Xi 2 = li (5-41) 

x*j+i — 2.^^x*j-\-x* j-i=t), j = 2, 3,..., h — 1, (5.42) 

“{'^ + ■2^)®i,ri + = Oj (5.43) 

where the first two equations follow directly from the definition of x* and relation (I5.39L and the last equation 
is implied by the definitions of k and x* in (13.391) and (15.381) . respectively. ■ 

We also need a few technical results to establish the lower complexity bounds. 
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Lemma 9 a) For any x > 1, we have 

log(l-i)>-^. (5.44) 

b) Let p,q,q& (0,1) be given. If we have 

- ^ tiog g+log(l-p) 

^ 2 log g 

for any t > 0, then 

-t 2h \ -t/1 2h\ 

q - q > (1 - 9 )• 

Proof. We first show part a). Denote <t>{x) = log(l — ^) + H can be easily seen that hma;_>+oo = 0. 
Moreover, for any a: > 1, we have 

<(>'(*) = = ^ (^ - < 0- 

which implies that ^ is a strictly decreasing fnnction for x > \. Hence, we must have > 0 for any x > 1. Part 
b) follows from the following simple calculation. 

-t 2n -t/-i 2fi\ /I \-t 2n , -t 2n \ /i \-t 2n \ n 

q-q -pq{^-q ) = \^-p)q-q + pq q >\^-p)q-q > o. 


We are now ready to prove Theorem |3l 

Proof of Theorem!^ Without loss of generality, we may assume that the initial point = 0, i = 1,..., m. Indeed, 
the incremental gradient methods described in Subsection 3.3 are invariant with respect to a simultaneous shift 
of the decision variables. In other words, the sequence of iterates {x^}, which is generated by such a method 
for minimizing the function I'{x) starting from x'^, is just a shift of the sequence generated for minimizing 
Fix) =L'(x + x'^) starting from the origin. 

Now let ki, i = 1,... ,m, denote the number of times that the gradients of the component function fi are 
computed from iteration 1 to k. Clearly fc^’s are binomial random variables supported on {0,1,..., A:} such that 
Also observe that we must have xf j = 0 for any fe > 0 and kj + 1 < j < h, because each time the 
gradient V/i is computed, the incremental gradient methods add at most one more nonzero entry to the i-th 
component of x^ due to the structure of the gradient V/^. Therefore, we have 

\2 




fc _ 2;*||2 
Tjrr 


(5.45) 


. ET=iEU.+iKjr YZ. - q^n 
j:T=iMr - ET=iEUKjf 

Observing that for any i = 1,..., m, 

= Elo = [1 - (1 - q^)p^]^ 

we then conclude from (15.4511 that 

Ilaj” —X* 11^ — m(l—’ 

Noting that [1 — (1 — q^)pi\^ is convex w.r.t. pi for any pi G [0,1] and fc > 1, by minimizing the RHS of the above 
bound w.r.t. i = 1,..., m, subject to E'^i Pi = ^ a-nd pi > 0, we conclude that 

(5.46) 

for any n > n{m, fc) (see (13.441) 1 and possible selection of Pi, i = 1,..., m satisfying (13.401) . where the last inequality 
follows from LemmaOb). Noting that 


> i[I-(I-g")/m]^ 


I - (I - q^)/'m = 1 - 


_ (,m=iy 


i = 1- 


± + ±(l _2_\ 

m ' m \ VQ+iJ 


= 1 - 


we then conclude from (15.461) and Lemma |9la) that 


+ 


m(VQ+lF 


= I - 


VQ+1 
4%/Q 

m(^/Q+lF ’ 




e»-x*I|T" - 2 


1 - 


4%/a 


m(%/a+l)^ 

> iexp -4fe^ 


= 5 exp 




m(^+l)^-4v^ 


)■ 
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6 Concluding remarks 

In this paper, we present a new class of optimal first-order methods, referred to as primal-dual gradient methods, 
for solving the finite-sum composite convex optimization problems given in the form of uni). The optimal 
convergence of this algorithm has been established based on the primal-dual optimality gap for the ergodic mean 
of iterates, i.e., 2 ^, and the distance from the iterate to the optimal solution x*. We also develop a randomized 
primal-dual gradient method which needs to compute the gradient of only one randomly selected component /^. 
The complexity bounds of the randomized primal-dual gradient method have been established in terms of the 
distance from the iterate x^ to the optimal solution, and the primal optimality gap based on the ergodic mean 
of iterates, i.e., E[!^(x^) — We show that these bounds are not improvable when the dimension n is large 
enough by developing new lower complexity bounds for randomized incremental gradient methods. Extensions 
of the randomized primal-dual gradient method to non-strongly convex, nonsmooth, and unbounded problems 
are also discussed in this paper. It should be noted that in this paper we focus on the theoretic convergence 
properties of these primal-dual gradient methods, and the algorithmic parameters were chosen in a conservative 
manner and were dependent on a few problem parameters, e.g., L and fi. In the future, it will be interesting to 
develop more adaptive versions of these algorithms which do not require the explicit estimation about L and 
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